4  Tidy & Transform

Code
import os, json, inspect
from pathlib import Path
from collections import defaultdict

import numpy as np
import pandas as pd
import networkx as nx
from sklearn.linear_model import LogisticRegression
from sklearn.ensemble import StackingClassifier
from sklearn.metrics import roc_auc_score, accuracy_score, brier_score_loss
from xgboost import XGBClassifier
import joblib

# ---------- config ----------
BASE_DIR  = Path("/Users/yifanw124/STAT468/stat468-final-project")
DATA_PATH = BASE_DIR / "tournaments_2018_2025_June.csv"
OUT_DIR   = BASE_DIR
OUT_MODEL = OUT_DIR / "stack_model.joblib"
OUT_META  = OUT_DIR / "feature_spec.json"

PIN_TO_S3          = os.getenv("PIN_TO_S3", "false").lower() == "true"
USE_VETIVER_BUNDLE = os.getenv("USE_VETIVER", "false").lower() == "true"
RANDOM_STATE       = 42

MODEL_BUCKET = os.getenv("MODEL_BUCKET", "")          # used only if PIN_TO_S3
MODEL_PIN    = os.getenv("MODEL_PIN", "stack_model")  # also used as vetiver model_name

# ---------- load ----------
df0 = pd.read_csv(DATA_PATH)
df0 = df0[df0["event"].str.contains("MS|WS", regex=True)].copy()
df0["date"] = pd.to_datetime(df0["date"])
df0 = df0.sort_values("date").reset_index(drop=True)

4.1 Tidy

Code
# ---------- load ----------
df0 = pd.read_csv(DATA_PATH)
df0 = df0[df0["event"].str.contains("MS|WS", regex=True)].copy()
df0["date"] = pd.to_datetime(df0["date"])
df0 = df0.sort_values("date").reset_index(drop=True)

In preparing the dataset for analysis, we began by importing the raw tournament records from the specified source file into a working DataFrame. From this full set of events, we retained only men’s singles (MS) and women’s singles (WS) matches, since our focus is on head-to-head singles competition. The match date field was converted to a proper datetime format to enable accurate time-based feature construction and ordering. We then sorted the data chronologically so that all subsequent feature calculations respect the temporal sequence of matches, and reset the index to maintain a clean, continuous row numbering for downstream processing.

4.2 Feature engineering

The predictive model relies on a set of engineered features designed to capture player skill, recent performance trends, head-to-head dynamics, and network-level influence. Each feature was constructed to reflect only information available at the time of prediction, thereby avoiding target leakage. The following describes each feature, its derivation, and its intended role in improving model performance.

4.3 Elo Ratings

Code
# ---------- Elo (online, no leakage) ----------
DEFAULT_ELO = 1200
K = 32

from collections import defaultdict
elo = defaultdict(lambda: DEFAULT_ELO)

def expected_score(rA, rB):
    return 1 / (1 + 10 ** ((rB - rA) / 400))

def update_elo(rA, rB, outcome_A):
    eA = expected_score(rA, rB)
    rA_new = rA + K * (outcome_A - eA)
    rB_new = rB + K * ((1 - outcome_A) - (1 - eA))
    return rA_new, rB_new

rows = []
for _, r in df0.iterrows():
    p1, p2 = str(r["player1"]), str(r["player2"])
    out1 = 1 if int(r["winner"]) == 1 else 0
    r1, r2 = elo[p1], elo[p2]
    sd = float(r["team1_total_points"] - r["team2_total_points"])

    # features BEFORE updating Elo to avoid leakage
    rows.append({
        "player_id": p1, "opponent_id": p2,
        "elo_player": r1, "elo_opponent": r2,
        "elo_diff": r1 - r2,
        "score_diff": sd,
        "win": out1,
        "date": r["date"],
        "tournament": r.get("tournament_name", None),
        "event": r["event"],
    })
    rows.append({
        "player_id": p2, "opponent_id": p1,
        "elo_player": r2, "elo_opponent": r1,
        "elo_diff": r2 - r1,
        "score_diff": -sd,
        "win": 1 - out1,
        "date": r["date"],
        "tournament": r.get("tournament_name", None),
        "event": r["event"],
    })

    elo[p1], elo[p2] = update_elo(r1, r2, out1)

df = pd.DataFrame(rows).sort_values("date").reset_index(drop=True)

For each match, the pre-match Elo ratings of both the focal player and their opponent were recorded. Elo ratings were initialized at 1200 for all players and updated online using a constant K=32 after each match. By recording these ratings before any update, the features represent each player’s latent skill level immediately prior to the match without incorporating outcome information. This choice captures the evolving competitive balance between players and is well suited to sports prediction contexts where performance changes over time.

The Elo rating difference, computed as elo_player - elo_opponent, distills the relative strength of the two competitors into a single scalar measure. Positive values indicate that the focal player entered the match as the Elo favourite, while negative values suggest the opponent held an advantage. Using a difference metric rather than two separate ratings reduces collinearity and can improve interpretability for models sensitive to redundant inputs.

The score differential reflects the raw margin of points in a match, computed as the difference between the two sides’ total points. It is recorded before Elo updates to prevent leakage and is mirrored appropriately when generating the opponent’s perspective row. This metric adds context on the dominance or closeness of prior performances, potentially revealing patterns not visible from binary win/loss records alone.

4.4 Rolling Win %

Code
# ---------- Rolling win% (shifted) ----------
for w in (5, 10, 20):
    df[f"win_pct_{w}"] = (
        df.groupby("player_id")["win"]
          .transform(lambda s: s.shift(1).rolling(w, min_periods=1).mean())
    )

Three rolling win percentage features were calculated over the most recent 5, 10, and 20 matches, each shifted by one match to exclude the current outcome. The use of multiple window sizes allows the model to detect both short-term momentum and longer-term consistency. Smaller windows respond quickly to form changes, while larger windows smooth short-term volatility and approximate a player’s baseline performance level.

4.5 Head to Head Exponential Decay

Code
# ---------- H2H exponential decay (shifted) ----------
alpha = 0.1
df["h2h_decay"] = (
    df.groupby(["player_id", "opponent_id"])["win"]
      .transform(lambda s: s.shift(1).ewm(alpha=alpha, adjust=False).mean())
)

# Opponent strength adjust (safe divide)
df["h2h_adj"] = (
    df["h2h_decay"] * (df["elo_opponent"] / df["elo_player"].replace(0, np.nan))
).fillna(0.0)

An exponentially weighted moving average of prior head-to-head results was computed for each player–opponent pair, with a decay parameter α=0.1. This measure emphasizes recent encounters while retaining older results at diminishing weight. It captures stylistic or matchup-specific dynamics that are not fully explained by overall skill ratings, reflecting the idea that certain players may consistently perform better or worse against particular opponents.

The head-to-head decay feature was further adjusted by the relative strength of the opponent, computed as the ratio of the opponent’s Elo to the player’s Elo. This scaling accounts for the fact that beating a strong opponent is more informative than beating a weaker one. The adjustment helps to prevent misleadingly high head-to-head scores when the wins were accumulated against underperforming or low-ranked opponents.

4.6 Time-based Split

Code
# ---------- Time-based split ----------
date_cut = df["date"].quantile(0.80)
df_tr = df[df["date"] <= date_cut].copy()
df_te = df[df["date"] >  date_cut].copy()

4.7 PageRank

Code
# ---------- PageRank (train period only) ----------
G = nx.DiGraph()
for _, rr in df_tr.iterrows():
    if rr["win"] == 1:
        G.add_edge(rr["opponent_id"], rr["player_id"])
pagerank = nx.pagerank(G, alpha=0.85) if G.number_of_nodes() > 0 else {}

df["pr_player"]   = df["player_id"].map(lambda x: pagerank.get(x, 0.0)).astype(float)
df["pr_opponent"] = df["opponent_id"].map(lambda x: pagerank.get(x, 0.0)).astype(float)

# Re-split after PR
df_tr = df[df["date"] <= date_cut].copy()
df_te = df[df["date"] >  date_cut].copy()

PageRank scores were calculated on a directed graph constructed from matches in the training period only, with edges directed from losers to winners. This network-based metric rewards victories over players who themselves have many quality wins, thereby encoding strength-of-schedule information. By computing PageRank solely on training data, the process avoids incorporating future results into the feature set.